Enhancing Information Retrieval Through Statistical Natural Language Processing: A Study of Collocation Indexing

نویسندگان

  • Ofer Arazy
  • Carson C. Woo
چکیده

Although the management of information assets—specifically, of text documents that make up 80 percent of these assets— an provide organizations with a competitive advantage, the ability of information retrieval (IR) systems to deliver relevant information to users is severely hampered by the difficulty of disambiguating natural language. The word ambiguity problem is addressed with moderate success in restricted settings, but continues to be the main challenge for general settings, characterized by large, heterogeneous document collections. Veda Storey was the accepting senior editor for this paper. Praveen Pathak served as a reviewer. The associate editor and two additional reviewers chose to remain anonymous. In this paper, we provide preliminary evidence for the usefulness of statistical natural language processing (NLP) techniques, and specifically of collocation indexing, for IR in general settings. We investigate the effect of three key parameters on collocation indexing performance: directionality, distance, and weighting. We build on previous work in IR to (1) advance our knowledge of key design elements for collocation indexing, (2) demonstrate gains in retrieval precision from the use of statistical NLP for general-settings IR, and, finally, (3) provide practitioners with a useful costbenefit analysis of the methods under investigation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical Identification of Collocations in Large Corpora for Information Retrieval

The linguistic phenomenon of collocation, the habitual juxtaposition of some words in natural language has been shown to benefit natural language processing tasks such as information retrieval. This paper examines the utility of several methods for collocation extraction for document retrieval, specifically for queries in question form.

متن کامل

Reflections of Accomplishments in Natural Language Based Detection and Summarization

The common tie among these lines of research is that natural language processing techniques offer a way of overcoming the weaknesses inherent to purely statistical approaches. GE pioneered the large-scale use of natural language processing techniques in information retrieval. Standard statistical search methods use words, word fragments, and simple collocations to index documents. The GE work i...

متن کامل

The Application of Fuzzy Logic to Collocation Extraction

Collocations are important for many tasks of Natural language processing such as information retrieval, machine translation, computational lexicography etc. So far many statistical methods have been used for collocation extraction. Almost all the methods form a classical crisp set of collocation. We propose a fuzzy logic approach of collocation extraction to form a fuzzy set of collocations in ...

متن کامل

Indexing Audio Documents by using Latent Semantic Analysis and SOM

This paper describes an important application for state-of-art automatic speech recognition , natural language processing and information retrieval systems. Methods for enhancing the indexing of spoken documents by using latent semantic analysis and self-organizing maps are presented, motivated and tested. The idea is to extract extra information from the structure of the document collection an...

متن کامل

Enhancing Detection through Linguistic Indexing and Topic Expansion

Natural language processing techniques may hold a tremendous potential for overcoming the inadequacies of purely quantitative methods of text information retrieval. Under the Tipster contracts in phases I through III, GE group has set out to explore this potential through development and evaluation of new text processing techniques. This work resulted in some significant advances and in a bette...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • MIS Quarterly

دوره 31  شماره 

صفحات  -

تاریخ انتشار 2007